Towards Improving the Detection Performance in Collaborative Visual Sensor Networks

Visual Sensor Networks (VSNs) exploit the processing and communication capabilities of modern smart cameras to handle a variety of applications such as security and surveillance and critical infrastructure protection. The performance of various tasks in such applications, such as activity recognition, tracking, etc., can be severely affected by the detection module especially when considering low-cost embedded smart cameras with limited processing capabilities. Hence, this paper presents research towards the development of optimization algorithms and decision making solutions to improve the detection performance of such VSNs. Specifically, it introduces a probabilistic detection model that can be used to characterize the detection capabilities of cameras, and shows how it can be used to reconfigure VSNs. Experimental as well as simulation results indicate that the proposed solution is able to effectively improve the robustness and overall detection performance of VSNs.


INTRODUCTION
Visual Sensor Networks (VSNs) consist of networked cameras that can communicate to collaboratively monitor an area by detecting, recognizing, and tracking targets, while observing a scene [2]. Modern smart cameras offer advanced sensing and processing capabilities and collaboration capabilities that facilitate their deployment in a wide range of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. applications ranging from security and surveillance to industrial monitoring and personalized healthcare [14], [15]. The overall performance in these applications relates directly to the detection module capabilities of each camera in the network, since tasks such as recognition and tracking require capturing multiple instances of each target to update its state (speed, velocity, etc.). Hence, it is of key importance to develop models and algorithms that are able to characterize the behaviour of a camera detection module and reconfigure the VSN in order to improve its detection performance.
The majority of existing works assume that the detection module of the cameras in VSNs operate under perfect conditions, and do not take into account the possibility that a camera may not detect a target that is located within in its Field-of-View (FoV). However, in real-world applications even cameras featuring sophisticated visual sensors and onboard processors for decision-making, are inherently error prone due to the probabilistic nature of the detection algorithms that rely on machine-learning. This becomes more apparent for low-power and low-cost camera systems [16], [6] that can be integrated into ubiquitous cyber physical systems but do not have the necessary resources to run demanding state-of-the-art object detection algorithms. Hence, such embedded systems either lower the resolution or run less demanding algorithms both of which compromise the performance of the object detection module. However, there is limited research in dealing with this issue in VSNs.
This paper presents research towards incorporating detection performance as a key metric that can be used to reconfigure VSNs in order to improve their efficiency. To this end, the contribution of this work is twofold. First, it proposes a flexible probabilistic model that can be used to study the impact of degrading detection performance in VSN applications, and also characterize the detection capabilities of each camera in the network. Second, an optimization algorithm is formulated that utilizes the respective detection performance achieved by each camera per target, in order to set a new pan and tilt angle for each camera that results in maximizing the overall detection performance of the network. The optimization algorithm allows to maximize the detection performance for multiple targets rather than only a single target. We show the application of the model and optimization algorithm in an active network of Raspberry-Pi-based pan-tilt smart cameras that monitor targets in the field. Also, we show how the results are affected through simulations for varying number of cameras and targets.
The rest of this paper is structured as follows. Section 2 outlines some key areas of emerging research in VSNs. In Section 3 we formally introduce the problem as well as assumptions for the visual sensors, targets, and the proposed detection model. Also, in this section we formulate an optimization algorithm that utilizes detection probability information in order to identify new camera configurations that maximize the overall target detection probability. In Section 4 we present the evaluation results for the proposed model and optimization algorithm both experimentally and through simulations. Finally, Section 5 provides concluding remarks and discusses directions for future work.

RELATED WORK
There has been an increasing amount of emerging research in VSNs towards developing collaborative and distributed vision algorithms, with an emphasis on PTZ cameras [5] and networks [9], as well as dynamic network reconfiguration [17]. For example, [7] and [8] deal with the problem of naivety in static VSNs where not all cameras observe all targets, but need to maintain a state estimate for each target. To address this problem the authors introduce a multi-target information consensus algorithm that handles the issues of naivety as well as estimation errors in tracking and data association. The work in [10] formulates a game-theoretic approach, so that cameras can opportunistically identify time instances where the network can reconfigure in order to meet tracking requirements of targets. The work in [18] investigates how to model the probability of targets entering or exiting from certain areas in order to steer the cameras towards monitoring those areas, whereas in [13], the authors consider a 3D environment where the height of targets plays an important role in the application, and reconfigure the network of cameras based on an activity relevance map. Finally, the authors in [12] propose a reconfiguration scheme for SCNs in order to reduce the uncertainty in the targets location and movement. As such, existing models and approaches used in VSNs assume that a visual sensor will always detect a target that is present in its FoV, and do not consider the uncertainty in the detection module of a smart camera. In this paper, we present research towards the development of a probabilistic detection model that captures the behaviour of the detection module and can thus provide additional input to decision-making and dynamic configuration algorithms. Additionally, we formulate an optimization algorithm that can take advantage of probabilistic detection information, such as the one provided by the proposed model, in order maximize the overall detection performance.

PROBLEM DESCRIPTION AND SOLU-TION OVERVIEW
We consider an active network consisting of NC smart camera nodes i that belong in the set C and NT targets j in the set T that are present in the area that is monitored. The objective is to configure the pan and tilt angles of the cameras in the network so that the overall cumulative detection probability of all targets is maximized (or equivalently minimize the miss-detection probability), thus effectively maximizing the expected number of detected targets.

Visual Sensor Model
Cameras in the network are considered to be active, in Figure 1: Camera Detection Model which case they have some degrees of freedom and can adjust their point of view based on collective or local information. For instance, it is possible that they can move in space, thus changing their location (x C i , y C i ), or they can remain in a specific location but change their pan Θ P i and tilt Θ T i parameters ( Fig. 1). All cameras i have a sensing range Ri, and are located at a height Hi. The camera monitors a specific area which is denoted as F L i and represents its local (current) FoV, and is a subset of the total area that a camera is able to monitor, denoted as F G i . We assume that a camera can change the pan and tilt angles by a fixed step and so the set of all configurations is finite. Specific values of these parameters correspond to a single configuration k that camera i can have from all possible finite configurations Ki. The active configuration for each camera i is specified by a binary variable x ik which is equal to one if the camera employs configuration k ∈ Ki and zero otherwise. We also assume that the cameras in the network are calibrated so that associations between them can be established, and that geometric information is available so that cameras can localize targets. It is assumed that the location (x T j , y T j ), and distance dij of target j from camera i can be determined on the scale size and resolution that it the target is detected. Each camera uses the above information to coordinate with other cameras regarding common target views. We assume that targets can move within the monitored area with a speed that permits their detection from the cameras.
VSN applications are associated with a moving target that can change its position and viewpoint orientation, and thus may affect the detection performance of a camera especially as its distance from the camera increases which decreases the its pixel resolution. Hence, camera i can detect the targets that are inside its local FoV F L i primarily depending on how far they are from it. Hence, the probability of detection is based on the sensing range Ri and local FoV of each camera. Through the following model we attempt to capture how the resolution of the target in the camera image affects the detection probability. The main characteristics of the model are also shown in Fig. 1. First, the global FoV of camera i is segmented into m detection zones, Zim, m = 1, · · ·, Nz, where Nz is the last zone that is located further away from the camera location. Depending on its current local FoV F L i a camera views a subset of the m zones as shown in Fig.  1. Within each zone, a target can be detected with a range of probabilities. Thus we characterize each zone with an average probability and assume a uniform constant detection probability for simplicity. Hence, when a target is in zone Zim of camera i it is assumed that on average is detected with probability Pim. A zone Zim has a higher detection probability if it is closer to camera location (x C i , y C i ), Pim > Pin, for m < n. A camera can establish the zone Zim that a target j is detected through trigonometry using the pan and tilt angles of current configuration k. Hence, we define the detection probability of camera i for target j using configuration k as p ijk = Pim, and subsequently the missdetection probability is q ijk = 1 − p ijk .

Optimization Algorithm for Configuration Assignment
In order to improve the detection capabilities of a VSN we formulate an appropriate optimization algorithm that utilizes the information from the aforementioned camera detection model in Section 3.1 to select an appropriate configuration for each camera i in the network in order to maximize the overall detection probability (and also maximize the expected number of observed targets). A key step in this process is to identify all possible configurations and subsequent targets that can be monitored with non-zero probability. This can be done through a systematic process where given that all target j positions are known (which can be achieved through wide-view static cameras [12]), each camera i generates all possible configurations k ∈ Ki and determines the set of target located within each F L i . We will not focus on the process of identifying camera configurations and target sets as it goes beyond the scope of this paper and since the optimization algorithm can operate either on an exhaustive list of configurations, or one where only a significant subset is present.
The above problem is equivalent to the minimization of the overall miss-detection probability. Hence, when multiple cameras cover the same target j then the overall detection probability for that target can be found using the product of the miss-detection probabilities as 1 − i∈C k∈K i q x ik ijk when the detections are uncorrelated. Hence the algorithm can be formulated as: Notice that the objective of (1) is nonlinear and solution with standard solvers is not possible. To deal with this issue we transform the problem into an equivalent problem based on [3]. Let 2 −z j = i∈C k∈K i q x ik ijk . Taking the logarithm of both sides gives zj = − i∈C k∈K i x ik log 2 (q ijk ), zj ≥ 0, and the formulation becomes: The new formulation (2) is an integer programming prob-lem with the objective function composed of separable monotonically increasing convex terms 2 −z j . Following the analysis from [11], each of these terms can be tightly approximated from the convex envelop φ(zj) of a number of piecewise linear functions. Towards this direction, let us assume that each term 2 −z j is approximated by Lj linear segments with slopes α1,j,...,αL j ,j and start-points β1,j,...,βL j ,j . Let us also assume that βL j +1,j = z max j . Because 2 −z j is convex and monotonically increasing, the envelop approximation φ(zj) will also be convex and the slopes will have monotone increasing values: α1,j < α2,j < ... < αL j ,j . Let ξ lj , l = 1, ..., Lj be the value of zj corresponding to the lth linear segment so that 0 ≤ ξ lj ≤ β l+1,j − β l,j , l = 1, ..., Lj. Under the assumption that ξij = βi+1,j − βi,j, i = 1, ..., l − 1 when ξ lj > 0, it is true that zj = L j l=1 ξ lj and also that φ(zj) = L j l=1 α l,j ξ lj . In other words, zj can be replaced by the sum of variables ξ lj , l = 1, ..., Lj if we can ensure that the solution of the optimization problem will always be such that each ξ lj is nonzero only when the variables ξ lj , i = 1, . . . , l − 1 have obtained their maximum value. As mentioned earlier, α1,j has the smallest slope value and hence ξ1j will be the first variable associated with zj to be assigned a nonzero value. Only when ξ1j has been assigned its maximum value variable ξ2j will be assigned a nonzero value and this procedure will continue until zj becomes equal to the sum of the nonzero variables. Thus, the assumption stated above is satisfied and formulation (2) becomes: 0 ≤ ξ lj ≤ β l+1,j − β l,j , l = 1, ..., Lj, j ∈ T (3d) Formulation (3) is a MILP optimization problem that can be solved with standard solvers. To compute the slopes and start-points of 2 −z j we employ a piecewise linear approximation scheme that minimizes the number of linear segments limiting the maximum approximation error to a desired value as proposed in [19].

Experimental Setup
To evaluate the proposed model and optimization algorithm we have developed a network of smart cameras based on the Raspberry Pi single-board computer [6]. Each Raspberry Pi is connected with a webcam that is mounted on a motorized two degrees-of-freedom (DoF) pan-tilt stage, as shown in Fig. 3. The servo motors are controlled by the Raspberry Pi and the control electronics using a pulse width modulation (PWM) approach. Communication between the camera stations is realized via a dedicated local Wi-Fi network. Each camera station is also fitted with programmable LEDs that indicate the status of the system. The cameras were also able to calculate an estimate of the targets position in a global reference system using geometric information and the current angle configurations. The target objects  were remote controlled cars. For this reason, we trained an image classifier capable of detecting cars using the Cascade Object Detection Algorithm [20] with Local binary Pattern (LBP) features which is available in the OpenCV computer  [4]. The training set was constructed using the database from [1] and was enhanced with additional sample images with a total of 800 positive and 3200 negative samples. The experiments were conducted in non-controlled environments with ambient light. For the camera detection model we employed a 3-zone approach and the detection probabilities for each zone were 90% for the proximal zone, 50% for the intermediate zone and 20% for the distant one. A different number of zones can be employed with respective detection probabilities, however, depending on the application scenario and operating environment.

Experimental Results
The extracted values and model were used to configure the camera stations for the experiments. Each station runs a supervised learning machine trained to detect the object of interest, and communicates wirelessly with a central server that runs the optimization algorithm and transmits back the new configuration parameters. Information exchanged between the stations includes a notification with the camera's ID each time a target was detected, the target's coordinates (derived from its position in the image and the joint rotations of the pan-tilt stage) as well as the detection probability for the target (corresponding to the spatial zone in which it was detected). The targets were placed in various positions within the field and the cameras proceed to calculate the corresponding detection probabilities and available configurations depending on the targets. The central server received all the data and computed new pan and tilt an- gles for each camera that maximized the network detection performance based on the outcome of the optimization algorithm. Following, we calculated the corresponding detection probabilities achieved for each target.
The optimization algorithm outcomes were verified using the three camera setup and up to four car targets in the monitored area. In Table 1 the detection probabilities achieved by individual cameras as well as the overall combined probabilities for each target are shown for a specific scenario. Furthermore, it also shows the expected value using the detection model from Section 3.1 and the actual measured detection probability. In this particular example we can observe how the optimization algorithm operates in order to produce a solution. Cameras 1 and 2 are configured to focus on targets 1 and 3 as they add more value towards maximizing the detection probability. Also notice that the measured values are indeed close to the expected ones. Minor discrepancies are due to the fact that the cameras may detect the target at the same time instance, in which case the combined values will be lower that the sum, or at different times, in which case the value will approach their sum.

Simulation Results
The simulation scenarios involved a square area where targets were generated at random positions and moved at predetermined structured paths. An equal number of cameras are placed at each side of the square field and we assume  that there are no obstacles in the area. We performed simulation studies for different number of targets (ranging from 5 to 20, with a step of 5) and cameras (ranging from 4 to 16, with a step of 4). For each combination of targets and cameras we run 1000 different scenarios and averaged the results across all runs. First Fig. 5 shows the combined detection probability (sum of all combined target probabilities) for all network cameras. Second, Fig. 6 shows the effective number of targets that are covered by the network of cameras (i.e. targets detected with a probability greater that zero). Together these two figures illustrate how the algorithm behaves with the increasing number of cameras and targets. As expected with a few cameras and high number of targets, the network is not able to fully cover all the targets. In such a case the optimization algorithm will configure the cameras to focus on the targets with a high detection probability. As the number of cameras increases the optimization approach manages to find solutions which maximize the overall detection performance and all the targets can be covered, also it tries to increase overlaps between cameras so that the missdetection probability is reduced. Another important result that stems from Fig. 7, that shows the minimum detection probability out of all targets, is that through this analysis we can determine how many cameras must be placed in an area in order to guarantee that a targets will be detected with a given probability. Again notice that as the number of cameras increases and the targets covered by multiple cameras are increased, the minimum detection probability with which a target can be detected with is also increased.

CONCLUDING REMARKS
This paper presented research towards improving the detection performance of VSNs consisting of low-cost embedded smart cameras. Through the utilization of a model to characterize the detection behaviour of smart cameras and an optimization algorithm that can make use of such information, we were able to improve the overall detection performance by reconfiguring the cameras in the network. We have evaluated the proposed model and optimization algorithm though experiments using real smart cameras, as well as through simulations studies.
The effort going forward will be on identifying possible improvements on the proposed solution. Some issues that require further research concern the efficient and optimal identification of the possible configurations that a camera can have, and how to achieve fairness so that no target remains uncovered. Finally, it is worth exploring a distributed implementation of the optimization algorithm so that it can be run on the cameras themselves and thus reduce communication overheads.

ACKNOWLEDGMENTS
This work was supported in part by the ERC Advanced Grant "Fault-Adaptive", ERC grant agreement no 291508.